Piecewise polynomials with smoothness constraints.
The red curve is the LOESS smoothed curve:
intro_splines.R SolutionsCopy this at the end of intro_splines.R from previous lecture:
# SOLUTIONS:
# 1. Note: Anything define in the ggplot() command trickles down to all the sub
# geoms. If you want to use data/aes() specific to only one particular geom,
# you define it inside:
ggplot(data=NULL, aes(x=x)) +
# What we're given - the obs y:
geom_point(data=real_life_augmented, aes(y=y)) +
# What we won't typically know, but here we pretend to - the f(x) i.e. the signal:
geom_line(data=simulated, aes(y=f), col="red", size=1) +
# Our guess at f(x), i.e. f_hat(x) - our fitted/predicted values:
geom_line(data=real_life_augmented, aes(y=.fitted), col="blue", size=1)
# 2. Let's toy around with df. In particular df=2, 10, 50, 99
real_life_augmented <- smooth.spline(real_life$x, real_life$y, df=50) %>%
augment() %>%
tbl_df()
ggplot(data=NULL, aes(x=x)) +
geom_point(data=real_life_augmented, aes(y=y)) +
geom_line(data=real_life_augmented, aes(y=.fitted), col="blue", size=1) +
geom_line(data=simulated, aes(y=f), col="red", size=1)df do?
- See handout.
- If you’re curious, see source code.
To Chapter3_Lab.Rmd from last lecture (better viewed from HTML document):
# Load packages:
library(tidyverse)
library(MASS)
library(ISLR)
library(broom)
# Fit model. Boston would be the training data in a Kaggle competition.
# i.e. we have the outcome variable median house value
model_SL <- lm(medv~lstat, data=Boston)
# new_values would be the test data in a Kaggle competition
# i.e. we don't have the outcome variable
new_values <- data_frame(lstat=c(5,10,15))
# This is the MSE using the Boston training data:
augment(model_SL) %>%
summarise(MSE=mean((medv-.fitted)^2))
# Or equivalently
augment(model_SL) %>%
summarise(MSE=mean((.resid)^2))
# This value does not exist, b/c we don't have the real y, so we can't compute
# the residuals:
augment(model_SL, newdata = new_values) %>%
summarise(MSE=mean((.resid)^2))# Fit model
model_full <- lm(medv~crim+zn+indus+chas+nox+rm+age+dis+rad+tax+ptratio+black+lstat, data=Boston)
# See coefficients
tidy(model_full)
# This is the MSE using the Boston training data:
# Or equivalently
augment(model_full) %>%
summarise(MSE=mean((.resid)^2))# Fit model
model_polynomial <-
lm(medv~
crim+zn+indus+chas+nox+rm+age+dis+rad+tax+ptratio+black+lstat+
I(crim^2)+I(zn^2)+I(indus^2)+I(chas^2)+I(nox^2)+I(rm^2)+I(age^2)+I(dis^2)+I(rad^2)+I(tax^2)+I(ptratio^2)+I(black^2)+I(lstat^2),
data=Boston)
# See coefficients
tidy(model_polynomial)
# This is the MSE using the Boston training data:
# Or equivalently
augment(model_polynomial) %>%
summarise(MSE=mean((.resid)^2))Today we do the following to a simple linear regression in R
broom PackageAs per the tidy tools manifesto, we introduce the broom package:
Compare:
- Col Archibald Gracie IV
- Patrick O’Keefe
From biology/medicine. Let
- \(Y\) = Blood-pressure
- \(X\) = Dosage of blood-pressure medicine (in ml)
- \(Y = \beta_0 + \beta_1 X\)
From economics/sociology. Let
- \(Y\) = Annual income
- \(X\) = # of years of schooling
- \(Y = \beta_0 + \beta_1 X\)
Fig 5.1 p.177:
Fig 5.3 p.179: Leave-One-Out Cross-Validation
Fig 5.5 p.181: \(k=5\) Fold CV
Formalize what we mean by:
Fig 2.2 p.17:
Fig 2.2 p.17:
- We submitted the following model to Kaggle: Predict
Survived <- ifelse(Sex == "female, 1, 0)- We got a score (in this case “Categorization Accuracy” AKA proportion correct) of 0.76555.
\[Score = \frac{1}{n}\sum_{i=1}^{n}I(y_i = \widehat{y}_i)\] where
- \(I(y_i = \widehat{y}_i) = 1\) if \(y_i = \widehat{y}_i\)
- \(I(y_i = \widehat{y}_i) = 0\) if \(y_i \neq \widehat{y}_i\)
train data (891 rows): used by you to train modeltest data (418 rows): used by Kaggle to evaluate/score/validate modeltest doesn’t havepseudo-train and pseudo-test sets via resampling from trainpseudo-testCreate disjoint pseudo_train and pseudo_test data sets using dplyr::anti_join
library(tidyverse)
# You may need to change your directory path:
train <- readr::read_csv("assets/Titanic/train.csv")
pseudo_train <- train %>%
sample_frac(0.8)
pseudo_test <- train %>%
anti_join(pseudo_train, by="PassengerId")See RStudio Menu Bar -> Help -> Cheatsheets -> Data Manipulation.
Compute your pseudo-score
pseudo_test %>%
# Create new column of predictions:
mutate(Survived_predicted = ifelse(Sex == "female", 1, 0)) %>%
# Compute score:
summarize(Score = mean(Survived == Survived_predicted))## # A tibble: 1 × 1
## Score
## <dbl>
## 1 0.8089888
Compare this to Kaggle score of 0.76555. Why are they different?
- They are different b/c of sampling variability. Both at sampling and resampling stages!
- Let’s repeat the above 10 times and get 10 pseudo-scores:
## [1] 0.787 0.775 0.837 0.803 0.792 0.826 0.787 0.781 0.809 0.781
In this case, we have variability due to the resampling. The average of the 10 scores is 0.798
R: tidyverse and the pipe operator %>%tidyversetidyverse is a set of packages that work in harmony because they share common data representations and API design.
install.packages("tidyverse")
library(tidyverse)Read more on RStudio’s Blog.
tidyverse Core PackagesThese get loaded when you run library(tidyverse)
ggplot2for data visualisation.dplyrfor data manipulation.tidyrfor data tidying.readrfor data import.purrrfor functional programming.tibblefor tibbles, a modern re-imagining of data frames.
tidyverse Non-Core PackagesThese get installed with tidyverse
hmsfor times.stringrfor strings.lubridatefor date/times.forcatsfor factors.
tidyverse Data Import PackagesThese get installed with tidyverse
DBIfor databases.havenfor SPSS, SAS and Stata files.httrfor web apis.jsonlitefor JSON.readxlfor .xls and .xlsx files.rvestfor web scraping.xml2for XML.
tidyverse Modelling PackagesThese get installed with tidyverse
modelrfor simple modelling within a pipelinebroomfor turning models into tidy data
If you are new to…
ggplot2: Package for Data Visualization, do Data Visualization with ggplot2 (Part 1), (Part 2), and (Part 3)dplyr: Package for Data Manipulation, do Data Manipulation in R with dplyr and Joining Data in R with dplyr.For \(i=1,\ldots,n\) passengers
PassengerIdSurvived (binary)Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked.Naïve models: Use no information about the passenger, i.e. no covariates
Survived == 0Survived == 1Survived == 1 with probability \(p\)Survived == 0 with probability \(1-p\)Sex?Survived == 1 if Sex == "female"Survived == 0 if Sex == "male"Chalk talk i.e. see your lecture notes.
Load CSV’s into R:
library(tidyverse)
gender_submission <- readr::read_csv("assets/Titanic/gender_submission.csv")
test <- readr::read_csv("assets/Titanic/test.csv")
train <- readr::read_csv("assets/Titanic/train.csv")readr:: just indicates the function is from the readr package. Use for disambiguation.The binary outcome varible Survived is included.
glimpse(train)## Observations: 891
## Variables: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
## $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0,...
## $ Pclass <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3,...
## $ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bra...
## $ Sex <chr> "male", "female", "female", "female", "male", "mal...
## $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, ...
## $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4,...
## $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1,...
## $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "1138...
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, ...
## $ Cabin <chr> NA, "C85", NA, "C123", NA, NA, "E46", NA, NA, NA, ...
## $ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", ...
Survived is now NOT included. There are 418 rows (passengers) you need to predict.
glimpse(test)## Observations: 418
## Variables: 11
## $ PassengerId <int> 892, 893, 894, 895, 896, 897, 898, 899, 900, 901, ...
## $ Pclass <int> 3, 3, 2, 3, 3, 3, 3, 2, 3, 3, 3, 1, 1, 2, 1, 2, 2,...
## $ Name <chr> "Kelly, Mr. James", "Wilkes, Mrs. James (Ellen Nee...
## $ Sex <chr> "male", "female", "male", "male", "female", "male"...
## $ Age <dbl> 34.5, 47.0, 62.0, 27.0, 22.0, 14.0, 30.0, 26.0, 18...
## $ SibSp <int> 0, 1, 0, 0, 1, 0, 0, 1, 0, 2, 0, 0, 1, 1, 1, 1, 0,...
## $ Parch <int> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Ticket <chr> "330911", "363272", "240276", "315154", "3101298",...
## $ Fare <dbl> 7.8292, 7.0000, 9.6875, 8.6625, 12.2875, 9.2250, 7...
## $ Cabin <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "B...
## $ Embarked <chr> "Q", "S", "Q", "S", "S", "S", "Q", "S", "C", "S", ...
This is what you submit: 418 rows with
- the ID variable
PassengerId- your predicted outcome variable
Survived
glimpse(gender_submission)## Observations: 418
## Variables: 2
## $ PassengerId <int> 892, 893, 894, 895, 896, 897, 898, 899, 900, 901, ...
## $ Survived <int> 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0,...
You can write to CSV via:
gender_submission %>% readr::write_csv("assets/Titanic/submission.csv")After submitting to Kaggle, you can see your ranking.
ggplot2, dplyr, modelrFinal Project: Kaggle (rhymes with haggle) competition.